Link to the github io page: here

Finding the simplest model and Feature size for anomaly detection using activation vectors from pretrained model¶

Name : Kiran Shrestha¶

Convolution Neural Net (CNN) has become the main machine learning tool on image datasets recently. Given an image, the convolution layer are good at learning the image basic features like lines, curves and edges using kernel learnt through supervised learning. These kernels covolve around each image produces lower level (more abstract) representation called channels.

Here is a simple demo of 3x3 kernel (in dark blue) convolving around the 5x5 image(in blue) to produce a 3x3 channel (in white).

channel.gif"

Usually, we use multiple kernels in each colvolution layer resulting to multiple channels. So, if there are multiple layers, the channels count will be large, making our data very high dimensional.

https://arxiv.org/pdf/2011.08785.pdf in this paper, I was first introduced to the idea of using random subsets of channels and still obtaining the performance similar to while using all the channels. In the paper, they try to localize all the anomaly regions in an image. Although the scope of our project is just to perform binary clssification, I am stil really curious if I could use data science approach to figure if I could pick the best subsets or maybe reduce the overall data size.

About the data¶

Before I explain about our dataframe, this is the what the raw images look like.

link to raw images

This is a normal transistor image, and referred as "normal image" in this notebook.

normal_img.png

These are transistor images with anomalies, and referred as "anomaly image" in this notebook.

anomaly_imgs.png

We applied image augmentation techniques to generate more images. The main task here is to classify if a given image is normal or anomaly.

However, we are not using these raw images for our data science project. In fact, we will pass each image through a CNN model and use the channels as our unit data. To be more specific, we use RESNET pretrained model and the channel output of the third layer as our data.

Below is the RESNET architecture and we can observe that layer 3 has output of shape 14x14x256, where each channel is of size 14x14.

resnet%20layers.png

To reduce the channel dimension, we average (global average pooling) each channel to make the final shape as 1x256. Hence, our each image is represented by vector of shape 1x256.

GAP.png

We will use data science approach to figure how to select the channels to maintain similar performance as to while using all the channels for classification task. We can also look for what dimension reduction works best for if the model to be used is simple logistic regression model.

How is this a data science problem?¶

  • What is the data? Here a unit data is the (1x256) row.
  • What is EDA? My EDA will mostly consists of performace comparision.
  • What is the variable? There are 256 features(channels) as variables.
  • What is the model? I mostly go over neural network. However, I plan to use logistic regression and simple outlier detection technique as well.

Import libraries¶

In [1]:
import random
from random import sample
import numpy as np
import os
from tqdm import tqdm
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd

from scipy.spatial.distance import mahalanobis

import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score,f1_score


from sklearn.covariance import LedoitWolf
import warnings
warnings.filterwarnings('ignore')

Load data¶

I have stored two numpy arrays and each for normal and anomaly images.

In [2]:
normal_data = np.load("normal.npy")
anomaly_data = np.load("anomaly.npy")

Convert into pandas¶

In [3]:
# Get the columns name
cols = ["c" + str(i) for i in range(256)]
# Create pandas dataframe
df = pd.DataFrame(np.concatenate((normal_data, anomaly_data), axis=0), columns = cols)

# Assign label = 0 for normal images  and label = 1 for anomaly
df['label'] = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
In [4]:
df.head()
Out[4]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c247 c248 c249 c250 c251 c252 c253 c254 c255 label
0 0.158984 0.056839 0.081127 0.046833 0.121052 0.127416 0.058033 0.105025 0.024803 0.038923 ... 0.064348 0.154216 0.101745 0.062601 0.105468 0.301576 0.097783 0.187822 0.100279 0
1 0.148324 0.037281 0.087455 0.035361 0.105669 0.196064 0.063659 0.178187 0.026070 0.009796 ... 0.072462 0.113702 0.164750 0.080955 0.066059 0.300927 0.104027 0.174209 0.047375 0
2 0.162493 0.054301 0.053285 0.055391 0.119270 0.128621 0.080290 0.107885 0.030830 0.031822 ... 0.060907 0.144773 0.124535 0.051645 0.102880 0.312054 0.080980 0.152039 0.118316 0
3 0.158661 0.073720 0.071030 0.039224 0.101606 0.083979 0.060536 0.126996 0.035288 0.026823 ... 0.045971 0.128997 0.078799 0.059057 0.091431 0.327879 0.086462 0.174518 0.107508 0
4 0.150962 0.048356 0.060710 0.064295 0.136142 0.139862 0.080558 0.127039 0.027589 0.024897 ... 0.059857 0.135180 0.113146 0.073884 0.080941 0.390073 0.075463 0.166586 0.076868 0

5 rows × 257 columns

In [5]:
df.describe()
Out[5]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c247 c248 c249 c250 c251 c252 c253 c254 c255 label
count 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 ... 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000 1805.000000
mean 0.157616 0.054395 0.085359 0.054675 0.115474 0.124441 0.098844 0.143318 0.034238 0.037625 ... 0.092088 0.096349 0.124493 0.083080 0.087279 0.364936 0.100797 0.188747 0.095412 0.243767
std 0.028883 0.016657 0.034048 0.027661 0.022775 0.019726 0.038861 0.038466 0.012033 0.028445 ... 0.043415 0.045875 0.027045 0.027615 0.013125 0.053219 0.023355 0.037013 0.019430 0.429473
min 0.060056 0.000407 0.023670 0.005960 0.016403 0.061239 0.039680 0.076324 0.000312 0.000411 ... 0.018627 0.001576 0.046374 0.014374 0.037666 0.216784 0.018390 0.105922 0.036836 0.000000
25% 0.137777 0.043657 0.064352 0.038185 0.104442 0.112087 0.075920 0.113100 0.025964 0.021286 ... 0.060066 0.058490 0.104350 0.063856 0.078192 0.330996 0.085766 0.165368 0.082994 0.000000
50% 0.161338 0.051488 0.073105 0.047939 0.118482 0.122699 0.090073 0.130767 0.033380 0.028622 ... 0.073684 0.108611 0.119653 0.076323 0.087256 0.367339 0.099524 0.179535 0.093917 0.000000
75% 0.178718 0.062503 0.088412 0.060845 0.131367 0.134876 0.105640 0.170695 0.041085 0.038684 ... 0.123349 0.134120 0.144549 0.094636 0.096132 0.391594 0.115847 0.202120 0.106301 0.000000
max 0.254942 0.121560 0.203937 0.193614 0.174762 0.205209 0.323324 0.288091 0.091181 0.168471 ... 0.258239 0.211719 0.245009 0.190859 0.132892 0.745504 0.186764 0.321174 0.198278 1.000000

8 rows × 257 columns

In [6]:
df['label'].value_counts()
Out[6]:
0    1365
1     440
Name: label, dtype: int64

So, small imbalance here, but should not be big issue.

Visualization¶

We first check if the columns are correlated since corrleation allows to use linear dimensionality reduction techniques like PCA or we can even simply remove columns with high correlations

In [7]:
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number, cmap = "coolwarm")
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

2 hidden layers neural network model¶

We first try with neural network since it usually perform better for high dimensional data.

In [8]:
# Function to return split the data with or without scaling

def get_train_test_split(X = None, y = None, rand_int = 42, scale = False):
    
    if X is None and y is None:
        X = np.concatenate((normal_data, anomaly_data), axis=0)
        y = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]


    X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                        stratify=y,
                                                        shuffle = True,
                                                        random_state = rand_int,
                                                        test_size=0.2)
    if scale:
        scaler = StandardScaler()
        scaler.fit(X_train)
        X_train_scaled = scaler.transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        return  X_train_scaled, X_test_scaled, y_train, y_test
    else:
        return X_train, X_test, y_train, y_test

Use all the columns

We train a model with all features and find average train and test score.

In [9]:
# List to store the scores
scores = []
# Train and test data
X_train, X_test, y_train, y_test = get_train_test_split()

for i in range(10):
    nn = MLPClassifier(hidden_layer_sizes= (20,20), alpha = 0.01,max_iter=1000)
    # Train the model
    nn.fit(X_train, y_train)
    # Train score
    train_score = nn.score(X_train, y_train)
    # test score
    test_score = nn.score(X_test, y_test)
    scores.append([train_score,test_score ])

# Find mean and covariance and standard deviation
_cov = np.cov(scores, rowvar=False)
mean = np.mean(scores, axis=0)
stand_dev = [_cov[i][i] for i in range(_cov.shape[0])]
stand_dev = np.sqrt(stand_dev)   

print("mean score is ", mean)
print("standard deviation is ", stand_dev)
mean score is  [0.9367036  0.89944598]
standard deviation is  [0.01554774 0.01197169]

Use 80 random columns

Here, we randomly select 80 columns and check how the performance changes.

In [10]:
def random_NN(X_train, y_train, X_test, y_test, rand_columns_count, hidden_layer_sizes = (20,20), loop_count = 10):
    scores = []
    for i in range(loop_count):
        col_count = X_train.shape[1]
        rand_cols = random.sample(range(col_count),rand_columns_count)
        rand_cols = sorted(rand_cols)

        nn = MLPClassifier(hidden_layer_sizes = hidden_layer_sizes, alpha = 0.001,max_iter=1000)
        # Train the model
        nn.fit(X_train[:, rand_cols], y_train)
        # Train score
        train_score = nn.score(X_train[:, rand_cols], y_train)
        # test score
        test_score = nn.score(X_test[:, rand_cols], y_test)
        scores.append([train_score,test_score ])

    # Find mean and covariance and standard deviation
    _cov = np.cov(scores, rowvar=False)
    mean = np.mean(scores, axis=0)
    stand_dev = [_cov[i][i] for i in range(_cov.shape[0])]
    stand_dev = np.sqrt(stand_dev)
    return scores, mean, stand_dev
In [11]:
scores, mean, stand_dev = random_NN(X_train, y_train, X_test, y_test, 
                                    rand_columns_count = 80, hidden_layer_sizes = (20,20), loop_count = 10)

print("mean score is ", mean)
print("standard deviation is ", stand_dev)
scores
mean score is  [0.90547091 0.87534626]
standard deviation is  [0.00864496 0.01905796]
Out[11]:
[[0.9023545706371191, 0.8337950138504155],
 [0.9092797783933518, 0.8836565096952909],
 [0.9030470914127424, 0.8753462603878116],
 [0.8981994459833795, 0.8670360110803325],
 [0.9127423822714681, 0.8919667590027701],
 [0.8878116343490304, 0.8698060941828255],
 [0.9120498614958449, 0.8614958448753463],
 [0.917590027700831, 0.8781163434903048],
 [0.9016620498614959, 0.8919667590027701],
 [0.909972299168975, 0.9002770083102493]]

These are the train and test scores at each instance

What's the point of this comparision?

As we can see, there is not much difference in the score with just 80 columns here. That means, there is way we can evaluate what the minimum number of columns can be to get reasonable performance while using the simplest classification model. Here, we used a Neural Net with 2 hidden layers since neural net performs well with large dimensional data. Do you think we can do better??

Feature selection using correlation¶

If two or more columns are highly correlated, we can keep one of them and remove the rest.

In [12]:
# This function drops columns with high correlation

def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model 
        to generalize and improves the interpretability of the model.

    Inputs: 
        x: features dataframe
        threshold: features with correlations greater than this value are removed

    Output: 
        dataframe that contains only the non-highly-collinear features
    '''

    # Calculate the correlation matrix
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i+1):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)

            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    x = x.drop(columns=drops)

    return x
In [13]:
#remove features with colinearity above 0.7 to prevent overfitting
df_new = remove_collinear_features(df.iloc[:,0:-1], 0.7)
c3 | c2 | 0.72
c9 | c2 | 0.81
c9 | c3 | 0.8
c14 | c3 | 0.71
c14 | c9 | 0.77
c20 | c2 | 0.7
c20 | c3 | 0.72
c20 | c9 | 0.76
c24 | c19 | 0.75
c25 | c3 | 0.72
c25 | c9 | 0.77
c25 | c14 | 0.73
c25 | c20 | 0.76
c30 | c2 | 0.73
c30 | c9 | 0.73
c30 | c16 | 0.74
c31 | c7 | 0.72
c40 | c19 | 0.71
c41 | c25 | 0.71
c43 | c34 | 0.75
c57 | c6 | 0.74
c57 | c43 | 0.7
c58 | c19 | 0.77
c59 | c6 | 0.72
c59 | c18 | 0.73
c59 | c43 | 0.78
c59 | c54 | 0.74
c60 | c2 | 0.74
c60 | c21 | 0.75
c61 | c43 | 0.72
c61 | c59 | 0.72
c62 | c9 | 0.7
c63 | c2 | 0.71
c63 | c9 | 0.8
c67 | c63 | 0.74
c70 | c2 | 0.82
c70 | c3 | 0.8
c70 | c9 | 0.87
c70 | c30 | 0.77
c70 | c63 | 0.76
c70 | c64 | 0.73
c72 | c7 | 0.73
c72 | c43 | 0.76
c72 | c59 | 0.77
c72 | c70 | 0.71
c73 | c59 | 0.79
c77 | c9 | 0.74
c80 | c9 | 0.73
c80 | c16 | 0.72
c80 | c30 | 0.71
c80 | c70 | 0.73
c89 | c26 | 0.72
c92 | c67 | 0.75
c92 | c91 | 0.76
c95 | c2 | 0.79
c95 | c9 | 0.74
c95 | c30 | 0.74
c95 | c70 | 0.81
c104 | c2 | 0.84
c104 | c9 | 0.74
c104 | c28 | 0.71
c104 | c30 | 0.71
c104 | c70 | 0.75
c104 | c88 | 0.71
c104 | c92 | 0.73
c104 | c95 | 0.73
c107 | c58 | 0.71
c107 | c103 | 0.74
c109 | c2 | 0.78
c109 | c3 | 0.78
c109 | c9 | 0.82
c109 | c14 | 0.74
c109 | c20 | 0.71
c109 | c25 | 0.74
c109 | c60 | 0.85
c109 | c62 | 0.71
c109 | c70 | 0.75
c110 | c109 | 0.72
c111 | c7 | 0.72
c111 | c28 | 0.78
c111 | c104 | 0.75
c113 | c2 | 0.72
c113 | c9 | 0.73
c113 | c30 | 0.74
c113 | c63 | 0.71
c113 | c70 | 0.72
c121 | c43 | 0.72
c121 | c59 | 0.7
c121 | c94 | 0.7
c125 | c2 | 0.74
c125 | c9 | 0.74
c125 | c64 | 0.74
c125 | c70 | 0.78
c125 | c72 | 0.76
c125 | c95 | 0.76
c125 | c109 | 0.74
c129 | c58 | 0.72
c131 | c6 | 0.79
c131 | c43 | 0.73
c131 | c57 | 0.76
c131 | c59 | 0.81
c131 | c121 | 0.76
c133 | c6 | 0.75
c133 | c43 | 0.75
c133 | c59 | 0.79
c133 | c131 | 0.79
c134 | c91 | 0.74
c134 | c107 | 0.75
c140 | c30 | 0.73
c140 | c70 | 0.7
c143 | c3 | 0.71
c143 | c9 | 0.77
c143 | c14 | 0.74
c143 | c20 | 0.71
c143 | c25 | 0.78
c143 | c70 | 0.71
c143 | c109 | 0.75
c144 | c2 | 0.84
c144 | c9 | 0.76
c144 | c70 | 0.74
c144 | c95 | 0.71
c144 | c104 | 0.8
c144 | c109 | 0.71
c147 | c7 | 0.83
c147 | c28 | 0.7
c147 | c43 | 0.74
c147 | c59 | 0.81
c147 | c61 | 0.71
c147 | c67 | 0.83
c147 | c70 | 0.77
c147 | c72 | 0.84
c147 | c80 | 0.7
c147 | c92 | 0.75
c147 | c95 | 0.72
c147 | c104 | 0.76
c147 | c111 | 0.74
c147 | c124 | 0.73
c147 | c125 | 0.79
c147 | c133 | 0.7
c154 | c0 | 0.73
c154 | c7 | 0.72
c154 | c28 | 0.86
c154 | c72 | 0.74
c154 | c104 | 0.73
c154 | c111 | 0.79
c154 | c147 | 0.81
c155 | c147 | 0.73
c158 | c7 | 0.81
c158 | c67 | 0.71
c158 | c91 | 0.73
c158 | c92 | 0.72
c158 | c104 | 0.7
c158 | c111 | 0.7
c158 | c125 | 0.71
c158 | c147 | 0.85
c158 | c154 | 0.75
c162 | c2 | 0.79
c162 | c9 | 0.76
c162 | c70 | 0.73
c162 | c104 | 0.78
c162 | c109 | 0.73
c162 | c143 | 0.73
c162 | c144 | 0.74
c163 | c21 | 0.71
c163 | c60 | 0.73
c163 | c109 | 0.72
c165 | c43 | 0.75
c165 | c59 | 0.78
c165 | c72 | 0.73
c165 | c73 | 0.71
c165 | c109 | 0.71
c165 | c121 | 0.81
c165 | c125 | 0.77
c165 | c131 | 0.72
c165 | c147 | 0.75
c167 | c7 | 0.76
c167 | c31 | 0.74
c167 | c147 | 0.76
c169 | c16 | 0.7
c169 | c30 | 0.79
c169 | c104 | 0.72
c172 | c7 | 0.71
c172 | c72 | 0.7
c172 | c147 | 0.76
c172 | c154 | 0.7
c172 | c158 | 0.71
c175 | c2 | 0.78
c175 | c3 | 0.74
c175 | c9 | 0.72
c175 | c20 | 0.73
c175 | c30 | 0.73
c175 | c104 | 0.71
c175 | c162 | 0.74
c176 | c58 | 0.72
c176 | c151 | 0.71
c178 | c2 | 0.82
c178 | c3 | 0.76
c178 | c9 | 0.85
c178 | c20 | 0.71
c178 | c25 | 0.73
c178 | c30 | 0.8
c178 | c63 | 0.81
c178 | c67 | 0.76
c178 | c70 | 0.85
c178 | c80 | 0.74
c178 | c92 | 0.7
c178 | c95 | 0.76
c178 | c104 | 0.8
c178 | c109 | 0.75
c178 | c113 | 0.73
c178 | c125 | 0.75
c178 | c140 | 0.72
c178 | c144 | 0.81
c178 | c147 | 0.75
c178 | c169 | 0.72
c178 | c175 | 0.71
c179 | c2 | 0.7
c179 | c64 | 0.76
c179 | c70 | 0.72
c179 | c72 | 0.78
c179 | c88 | 0.7
c179 | c104 | 0.7
c179 | c125 | 0.8
c179 | c147 | 0.82
c179 | c154 | 0.75
c179 | c155 | 0.7
c179 | c158 | 0.8
c179 | c172 | 0.71
c180 | c7 | 0.71
c180 | c67 | 0.73
c180 | c147 | 0.79
c180 | c158 | 0.76
c180 | c179 | 0.72
c182 | c2 | 0.74
c182 | c9 | 0.81
c182 | c43 | 0.73
c182 | c59 | 0.72
c182 | c60 | 0.74
c182 | c61 | 0.71
c182 | c62 | 0.72
c182 | c63 | 0.74
c182 | c70 | 0.78
c182 | c72 | 0.71
c182 | c109 | 0.81
c182 | c110 | 0.71
c182 | c121 | 0.75
c182 | c125 | 0.8
c182 | c144 | 0.72
c182 | c147 | 0.78
c182 | c165 | 0.81
c182 | c178 | 0.79
c185 | c19 | 0.74
c185 | c58 | 0.7
c185 | c129 | 0.75
c185 | c181 | 0.71
c187 | c2 | 0.88
c187 | c3 | 0.71
c187 | c9 | 0.77
c187 | c64 | 0.75
c187 | c70 | 0.82
c187 | c95 | 0.75
c187 | c104 | 0.81
c187 | c109 | 0.73
c187 | c113 | 0.73
c187 | c125 | 0.77
c187 | c144 | 0.73
c187 | c147 | 0.73
c187 | c162 | 0.81
c187 | c164 | 0.76
c187 | c175 | 0.84
c187 | c178 | 0.76
c187 | c179 | 0.78
c187 | c182 | 0.72
c190 | c2 | 0.8
c190 | c9 | 0.84
c190 | c14 | 0.78
c190 | c43 | 0.77
c190 | c50 | 0.71
c190 | c59 | 0.76
c190 | c60 | 0.74
c190 | c61 | 0.77
c190 | c62 | 0.78
c190 | c63 | 0.73
c190 | c70 | 0.79
c190 | c72 | 0.72
c190 | c73 | 0.73
c190 | c104 | 0.73
c190 | c109 | 0.86
c190 | c121 | 0.71
c190 | c125 | 0.78
c190 | c133 | 0.7
c190 | c143 | 0.74
c190 | c144 | 0.74
c190 | c147 | 0.77
c190 | c162 | 0.74
c190 | c165 | 0.78
c190 | c178 | 0.75
c190 | c182 | 0.83
c190 | c187 | 0.8
c191 | c9 | 0.72
c191 | c190 | 0.72
c192 | c23 | 0.72
c192 | c43 | 0.77
c198 | c188 | 0.72
c201 | c9 | 0.71
c201 | c43 | 0.79
c201 | c59 | 0.75
c201 | c61 | 0.72
c201 | c70 | 0.76
c201 | c72 | 0.77
c201 | c125 | 0.72
c201 | c147 | 0.78
c201 | c165 | 0.7
c201 | c182 | 0.75
c201 | c190 | 0.77
c202 | c28 | 0.73
c202 | c59 | 0.71
c202 | c72 | 0.74
c202 | c147 | 0.79
c202 | c154 | 0.8
c202 | c158 | 0.71
c202 | c179 | 0.7
c205 | c6 | 0.76
c205 | c14 | 0.74
c205 | c43 | 0.82
c205 | c54 | 0.72
c205 | c57 | 0.71
c205 | c59 | 0.87
c205 | c61 | 0.75
c205 | c67 | 0.72
c205 | c72 | 0.72
c205 | c73 | 0.76
c205 | c121 | 0.77
c205 | c131 | 0.84
c205 | c133 | 0.83
c205 | c147 | 0.8
c205 | c165 | 0.79
c205 | c182 | 0.74
c205 | c190 | 0.81
c205 | c192 | 0.7
c205 | c201 | 0.77
c207 | c9 | 0.72
c208 | c2 | 0.75
c208 | c7 | 0.73
c208 | c9 | 0.74
c208 | c43 | 0.76
c208 | c59 | 0.76
c208 | c61 | 0.72
c208 | c67 | 0.75
c208 | c70 | 0.81
c208 | c72 | 0.8
c208 | c104 | 0.77
c208 | c125 | 0.78
c208 | c147 | 0.91
c208 | c154 | 0.77
c208 | c158 | 0.79
c208 | c165 | 0.79
c208 | c172 | 0.72
c208 | c178 | 0.77
c208 | c179 | 0.78
c208 | c180 | 0.71
c208 | c182 | 0.81
c208 | c187 | 0.79
c208 | c190 | 0.82
c208 | c201 | 0.78
c208 | c202 | 0.79
c208 | c205 | 0.8
c209 | c9 | 0.71
c209 | c59 | 0.77
c209 | c70 | 0.76
c209 | c73 | 0.79
c209 | c109 | 0.72
c209 | c121 | 0.71
c209 | c125 | 0.74
c209 | c147 | 0.74
c209 | c165 | 0.78
c209 | c182 | 0.75
c209 | c190 | 0.8
c209 | c205 | 0.79
c209 | c208 | 0.76
c213 | c13 | 0.75
c213 | c64 | 0.72
c218 | c176 | 0.77
c218 | c185 | 0.71
c226 | c109 | 0.74
c226 | c190 | 0.74
c226 | c205 | 0.71
c228 | c19 | 0.73
c228 | c58 | 0.72
c228 | c218 | 0.73
c229 | c80 | 0.71
c230 | c6 | 0.75
c230 | c7 | 0.78
c230 | c9 | 0.71
c230 | c43 | 0.85
c230 | c54 | 0.72
c230 | c59 | 0.89
c230 | c61 | 0.75
c230 | c62 | 0.71
c230 | c63 | 0.72
c230 | c67 | 0.79
c230 | c70 | 0.76
c230 | c72 | 0.84
c230 | c73 | 0.76
c230 | c121 | 0.74
c230 | c125 | 0.77
c230 | c131 | 0.81
c230 | c133 | 0.82
c230 | c147 | 0.9
c230 | c154 | 0.73
c230 | c158 | 0.77
c230 | c165 | 0.84
c230 | c172 | 0.76
c230 | c178 | 0.76
c230 | c179 | 0.78
c230 | c180 | 0.77
c230 | c182 | 0.8
c230 | c187 | 0.74
c230 | c190 | 0.82
c230 | c192 | 0.72
c230 | c201 | 0.83
c230 | c202 | 0.77
c230 | c205 | 0.89
c230 | c208 | 0.89
c230 | c209 | 0.78
c236 | c88 | 0.75
c236 | c103 | 0.73
c236 | c106 | 0.72
c236 | c107 | 0.71
c236 | c147 | 0.74
c236 | c158 | 0.78
c236 | c179 | 0.77
c240 | c2 | 0.78
c240 | c9 | 0.79
c240 | c60 | 0.83
c240 | c62 | 0.75
c240 | c63 | 0.73
c240 | c70 | 0.73
c240 | c109 | 0.87
c240 | c125 | 0.72
c240 | c144 | 0.73
c240 | c163 | 0.71
c240 | c165 | 0.7
c240 | c178 | 0.73
c240 | c182 | 0.81
c240 | c190 | 0.83
c240 | c208 | 0.72
c240 | c209 | 0.71
c240 | c226 | 0.74
c242 | c2 | 0.8
c242 | c64 | 0.82
c242 | c70 | 0.74
c242 | c104 | 0.78
c242 | c162 | 0.79
c242 | c175 | 0.8
c242 | c179 | 0.71
c242 | c187 | 0.87
c244 | c7 | 0.75
c244 | c91 | 0.73
c244 | c103 | 0.71
c244 | c107 | 0.7
c244 | c111 | 0.71
c244 | c147 | 0.78
c244 | c158 | 0.79
c244 | c179 | 0.75
c244 | c236 | 0.81
c245 | c2 | 0.76
c245 | c7 | 0.72
c245 | c9 | 0.8
c245 | c43 | 0.75
c245 | c59 | 0.79
c245 | c62 | 0.73
c245 | c63 | 0.74
c245 | c64 | 0.73
c245 | c67 | 0.79
c245 | c70 | 0.83
c245 | c72 | 0.78
c245 | c73 | 0.71
c245 | c80 | 0.72
c245 | c92 | 0.71
c245 | c95 | 0.73
c245 | c104 | 0.75
c245 | c109 | 0.72
c245 | c110 | 0.72
c245 | c113 | 0.76
c245 | c125 | 0.81
c245 | c133 | 0.72
c245 | c144 | 0.74
c245 | c147 | 0.86
c245 | c155 | 0.7
c245 | c158 | 0.73
c245 | c165 | 0.8
c245 | c172 | 0.72
c245 | c178 | 0.82
c245 | c179 | 0.81
c245 | c180 | 0.7
c245 | c182 | 0.81
c245 | c187 | 0.85
c245 | c190 | 0.83
c245 | c201 | 0.79
c245 | c202 | 0.71
c245 | c205 | 0.8
c245 | c207 | 0.73
c245 | c208 | 0.86
c245 | c209 | 0.76
c245 | c230 | 0.91
c245 | c240 | 0.7
c246 | c92 | 0.73
c246 | c111 | 0.72
c246 | c147 | 0.77
c246 | c158 | 0.72
c247 | c7 | 0.82
c247 | c67 | 0.76
c247 | c72 | 0.73
c247 | c147 | 0.86
c247 | c154 | 0.75
c247 | c158 | 0.84
c247 | c167 | 0.72
c247 | c172 | 0.7
c247 | c179 | 0.74
c247 | c180 | 0.74
c247 | c202 | 0.7
c247 | c208 | 0.79
c247 | c230 | 0.84
c247 | c245 | 0.72
c248 | c2 | 0.7
c248 | c7 | 0.77
c248 | c28 | 0.78
c248 | c72 | 0.72
c248 | c88 | 0.75
c248 | c91 | 0.7
c248 | c92 | 0.75
c248 | c104 | 0.78
c248 | c111 | 0.8
c248 | c147 | 0.82
c248 | c154 | 0.78
c248 | c158 | 0.8
c248 | c179 | 0.76
c248 | c187 | 0.74
c248 | c208 | 0.75
c248 | c236 | 0.74
c248 | c242 | 0.71
c248 | c244 | 0.77
c248 | c245 | 0.71
c248 | c246 | 0.78
c248 | c247 | 0.73
c249 | c111 | 0.74
c250 | c147 | 0.76
c250 | c190 | 0.73
c250 | c205 | 0.74
c250 | c208 | 0.73
c250 | c209 | 0.73
c250 | c230 | 0.74
c254 | c208 | 0.74
c254 | c209 | 0.72
c254 | c230 | 0.79
c254 | c245 | 0.75
In [14]:
df_new.shape
Out[14]:
(1805, 169)

So, we went from 256 features to 169 features.

Train test split¶

In [15]:
X, y = df_new.to_numpy(), df.iloc[:,-1].to_numpy()

X_train, X_test, y_train, y_test = get_train_test_split(X = X, y = y, scale = True)

Cross Validation space

In [16]:
param_grid={'hidden_layer_sizes':[(10,10,10),(15,15,15),(20,20,20),(15,15), (20,20)], 
             "alpha": [0.00001,0.0001,0.001,0.01]} 

# Create the model to be tuned
n_iter = 30 
folds = 8
nn_base = MLPClassifier(max_iter=1000)

# Create the random search Random Forest
nn_random = RandomizedSearchCV(estimator = nn_base, param_distributions = param_grid,
                               n_iter = n_iter, cv = folds, verbose = 2, random_state = 42, 
                               n_jobs = -1)

# Create the grid search
nn_grid = GridSearchCV(estimator = nn_base, param_grid = param_grid,
                       cv = folds, verbose = 2, n_jobs = -1)
In [17]:
# Shows the confusion matrix

def confusion(model,x,y):
    pred = model.predict(x)
    fig, ax = plt.subplots(figsize=(10, 5))
    ConfusionMatrixDisplay.from_predictions(y, pred, ax=ax)
    ax.xaxis.set_ticklabels(["normal","anomaly"])
    ax.yaxis.set_ticklabels(["normal","anomaly"])
    _ = ax.set_title(
        f"Confusion Matrix for {nn.__class__.__name__}"
    )

Randomized¶

In [18]:
# Fit the random search model
nn_random.fit(X_train, y_train)
print("Best train score ", nn_random.best_score_)
print("Test score ", nn_random.score(X_test, y_test))
      
# View the best parameters from the random search
print(nn_random.best_params_)
nn_model = nn_random.best_estimator_
nn_model.fit(X_train, y_train)

print("test score after refitting", nn_model.score(X_test, y_test))
Fitting 8 folds for each of 20 candidates, totalling 160 fits
Best train score  0.9411487108655617
Test score  0.9362880886426593
{'hidden_layer_sizes': (20, 20), 'alpha': 1e-05}
test score after refitting 0.9362880886426593
In [19]:
confusion(nn_model,X_test,y_test)

Grid search Cv¶

In [20]:
# Fit the random search model
nn_grid.fit(X_train, y_train)
print("Best train score ", nn_grid.best_score_)
print("Test score ", nn_grid.score(X_test, y_test))
      
# View the best parameters from the random search
print(nn_grid.best_params_)
nn_model = nn_grid.best_estimator_
nn_model.fit(X_train, y_train)

print("test score after refitting", nn_model.score(X_test, y_test))
Fitting 8 folds for each of 20 candidates, totalling 160 fits
Best train score  0.9425145794966237
Test score  0.9362880886426593
{'alpha': 0.0001, 'hidden_layer_sizes': (20, 20)}
test score after refitting 0.9307479224376731
In [21]:
confusion(nn_model,X_test,y_test)

Instead of using correlation, what if we use random 169 columns to compare which does better?

In [22]:
scores, mean, stand_dev = random_NN(X_train, y_train, X_test, y_test, 
                                    rand_columns_count = 169, hidden_layer_sizes = (20,20), loop_count = 10)

print("mean score is ", mean)
print("standard deviation is ", stand_dev)
scores
mean score is  [1.         0.93379501]
standard deviation is  [0.         0.00936654]
Out[22]:
[[1.0, 0.9418282548476454],
 [1.0, 0.9307479224376731],
 [1.0, 0.9390581717451524],
 [1.0, 0.9335180055401662],
 [1.0, 0.9113573407202216],
 [1.0, 0.9445983379501385],
 [1.0, 0.9362880886426593],
 [1.0, 0.9390581717451524],
 [1.0, 0.9279778393351801],
 [1.0, 0.9335180055401662]]

Correlation gives test score of 93% while random selection gives avg test score of 93% as well. Random selection further performs better at training data!! So, we need to look for better approaches. Also, we need to reduce more than to 169. Let's see if we can do better!

Using principal component analysis (PCA)¶

PCA is a dimensionality reduction technique. It reduces the variables while trying to preserve the information asa much as possible. Using eigenvector and eigenvalues, it determines the principal components which are uncorrelated and compress as much information into the first components.

Earlier in the correlation heatmap, we found that the columns are correlated to each other. Hence, we can use PCA to transform the data.

Find the best number of componenets¶

In [23]:
# Train test split with standard scaling
X_train_scaled, X_test_scaled, y_train, y_test = get_train_test_split(scale = True)
In [24]:
# PCA with 95% explained variance ration
pca = PCA(0.95)
pca.fit(X_train_scaled)
Out[24]:
PCA(n_components=0.95)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=0.95)
In [25]:
# Variance ration coverage by n components
cum_var = np.cumsum(pca.explained_variance_ratio_ * 100)
In [26]:
plt.plot(cum_var)
plt.axhline(y = 95, color = 'r', linestyle = 'dashed')   
plt.axvline(x = 92, color = 'g', linestyle = 'dashed')   
plt.xlabel('Number of components')
plt.ylabel('Explained Variance')
plt.yticks(sorted(list(range(0,110,10))+[95]));
plt.grid()
plt.show()
In [27]:
print(f"we need {len(pca.components_)} components")
we need 92 components

Hence, the dimension reduced from 256 to 92!!

In [28]:
# Transform the data
x_pca = pca.transform(X_train_scaled)
x_pca.shape
Out[28]:
(1444, 92)
In [29]:
nn = MLPClassifier(hidden_layer_sizes= (20,20), alpha = 0.001,max_iter=1000)
nn.fit(x_pca, y_train)

x_test_pca =  pca.transform(X_test_scaled)

print("training score is ", nn.score(x_pca,y_train))
print("test score is ",nn.score(x_test_pca,y_test) )
training score is  1.0
test score is  0.9362880886426593
In [30]:
# Confusion matrix
confusion(nn,x_test_pca,y_test)

Plot using different n components¶

We will experiment with different n components

In [31]:
n_list = np.linspace(2,X_train.shape[1], 20 ,dtype=int)
n_list
Out[31]:
array([  2,  10,  19,  28,  37,  45,  54,  63,  72,  81,  89,  98, 107,
       116, 125, 133, 142, 151, 160, 169])

We check with one hidden layer here

In [32]:
# Stores all the scores
pca_track = []
for n in n_list:
    pca = PCA(n)   
    # Fit the train data
    pca.fit(X_train_scaled)
    # Transform the train data 
    x_pca = pca.transform(X_train_scaled)
    # Simple NN model
    nn = MLPClassifier(hidden_layer_sizes= (20), alpha = 0.01,max_iter=1000)
    # Fit the model
    nn.fit(x_pca, y_train)
    # Train score
    train_score = nn.score(x_pca, y_train)
    # Test score
    x_test_pca =  pca.transform(X_test_scaled)
    test_score = nn.score(x_test_pca,y_test)
    # Append
    pca_track.append([n, train_score, test_score])
In [33]:
plt.figure(figsize = (12,8))
plt.plot(n_list, np.array(pca_track)[:,1], marker = "o", linestyle ="--", label = "train", linewidth = "0.5", color = "blue")
plt.plot(n_list, np.array(pca_track)[:,2], marker = "o", linestyle ="-", label = "test", linewidth = "1", color = "red")
plt.xticks(n_list+[2])
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed')   
plt.xlabel("n components")
plt.ylabel("classification Accuracy")
plt.title("Train vs test accuracy with different PCA n-components using 1 hidden layer")
plt.show()

Compared to correlation approach for dimension reduction (169 columns), PCA does much better with 90 above test accuracy with even as small as 40 components

We check with no hidden layer here

In [34]:
pca_track = []
for n in n_list:
    pca = PCA(n)   
    # Fit the train data
    pca.fit(X_train_scaled)
    # Transform the train data 
    x_pca = pca.transform(X_train_scaled)
    # Simple NN model
    nn = MLPClassifier(hidden_layer_sizes= (), alpha = 0.01,max_iter=1000)
    # Fit the model
    nn.fit(x_pca, y_train)
    # Train score
    train_score = nn.score(x_pca, y_train)
    # Test score
    x_test_pca =  pca.transform(X_test_scaled)
    test_score = nn.score(x_test_pca,y_test)
    # Append
    pca_track.append([n, train_score, test_score])
In [35]:
plt.figure(figsize = (12,8))
plt.plot(n_list, np.array(pca_track)[:,1], marker = "o", linestyle ="--", label = "train", linewidth = "0.5", color = "blue")
plt.plot(n_list, np.array(pca_track)[:,2], marker = "o", linestyle ="-", label = "test", linewidth = "1", color = "red")
plt.xticks(n_list+[2])
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed')   
plt.xlabel("n components")
plt.ylabel("classification Accuracy")
plt.title("Train vs test accuracy with different PCA n-components using no hidden layer")
plt.show()

Even without any hidden layers (essentially a logistic regression), we obtain good performance with as smasll as 47 components

What about random columns??

In [36]:
random_track = []
for n in n_list:
    _, mean, stand_dev = random_NN(X_train_scaled, y_train, X_test_scaled, y_test, 
                                    rand_columns_count = n, hidden_layer_sizes = (20), loop_count = 10)  
    # Append
    random_track.append([n, mean[0], mean[1]])
In [37]:
plt.figure(figsize = (12,8))
plt.plot(n_list, np.array(random_track)[:,1], marker = "o", linestyle ="--", label = "train", linewidth = "0.5", color = "blue")
plt.plot(n_list, np.array(random_track)[:,2], marker = "o", linestyle ="-", label = "test", linewidth = "1", color = "red")
plt.xticks(n_list+[2])
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed')   
plt.xlabel("n components")
plt.ylabel("classification Accuracy")
plt.title("Train vs test accuracy with different random selection using one hidden layer")
plt.show()

what's the observation?

Clearly, once we start to reduce the dimension, PCA outperforms the random selection. Now, the question is if there is any way we can outperform PCA itself.

Maybe we can ask PCA does nor work well 10 components or with 2 components.

Can PCA classify correctly using just 2 components??¶

In [38]:
pca_2 = PCA(2)
pca.fit(X_train_scaled)
x_pca_2 = pca.transform(X_train_scaled)
In [39]:
plt.figure()
sns.scatterplot(x = x_pca_2[:,0], y = x_pca_2[:,1], hue = y_train)
Out[39]:
<Axes: >

We can see that the normal images are clustered into 3 groups. However, the anomaly images have no such clusters and scattered around. Hence, model cannot predict the labels just using 2 components or a very few components.

To solve this issue, we introduce the idea of gaussian/normal distribution. Using normal distribution, we can make comapre how far from the normal distibution each data point is.

Using gaussian distance¶

Here, we use the mean and covariance to make a simple classifier.

Normal data to use for mean and cov

In [40]:
# covariance
_cov = np.cov(normal_data, rowvar=False)
# Mean
mean = np.mean(normal_data, axis=0)
# Inverse of covariance
cov_inv = np.linalg.inv(_cov)

Distance of normal data from Gaussian distribution using mahalanobis distance

In [41]:
dist_normal = [mahalanobis(_img, mean, cov_inv) for _img in normal_data]
dist_anomaly = [mahalanobis(_img, mean, cov_inv) for _img in anomaly_data]

We plot how the deviations for normal data and anomaly data distributed. Normal images should have relatively smaller distance whereas anomaly should have larger.

In [42]:
plt.hist(dist_normal,
         alpha=0.5, # the transaparency parameter 
         label='Normal data') 
  
plt.hist(dist_anomaly, 
         alpha=0.5, 
         label='Anomaly data') 
  
plt.legend(loc='upper right') 
plt.title('Histogram of normal and anomaly data') 
plt.xlabel("Mahalanobis distance from Gaussian distribution")
plt.ylabel("Frequency")
plt.show()

We can see that there is overlapping region. Hence, a simple classifier will not be able to classify just using a distance threshold.

Detect outlier¶

Find the lower and upper bound to find the outlier

In [43]:
q1, q3= np.percentile(dist_normal,[25,75])
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr) 
upper_bound = q3 +(1.5 * iqr) 
print("lower bound is ", lower_bound)
print("upper bound is ", upper_bound)
lower bound is  9.04635286010004
upper bound is  22.595015689220833
In [44]:
def is_normal(distance, lower_bound = lower_bound, upper_bound = upper_bound):
    if lower_bound < distance < upper_bound:
        return True
    else:
        return False
    

For normal data

In [45]:
correct = 0
incorrect = 0
for _dist in dist_normal:
    is_norm = is_normal(_dist)
    if is_norm:
        correct+=1
    else:
        incorrect+=1

normal_classification_accuracy =  round(correct/(correct + incorrect), 4)

For anomaly data

In [46]:
correct = 0
incorrect = 0
for _dist in dist_anomaly:
    is_norm = is_normal(_dist)
    if is_norm:
        incorrect+=1
    else:
        correct+=1
        
anomaly_classification_accuracy =  round(correct/(correct + incorrect), 4)
In [47]:
bar_labels = ["normal data", "anomaly data"]
bar_vals = [normal_classification_accuracy, anomaly_classification_accuracy]
plt.bar(bar_labels,bar_vals)
for i,j in zip(bar_labels, bar_vals):
    plt.text(i,j, str(j))
plt.xlabel("data class")
plt.ylabel(" Classification accuracy")
plt.title("Comparision of classification accuracy using Mahalobinas distance")
Out[47]:
Text(0.5, 1.0, 'Comparision of classification accuracy using Mahalobinas distance')

So, this simple model is not a good one.
Possible issue could be:

  • we are using all the features which may be adding noise in finding the distance

Use PCA transformation¶

In [48]:
pca = PCA(0.95)
pca.fit(normal_data)
components_count = len(pca.components_)
components_count
Out[48]:
52
In [49]:
_normal_pca = pca.transform(normal_data)
_anomaly_pca = pca.transform(anomaly_data)

_cov = np.cov(_normal_pca, rowvar=False)
mean = np.mean(_normal_pca, axis=0)
cov_inv = np.linalg.inv(_cov)

dist_normal = [mahalanobis(_img, mean, cov_inv) for _img in _normal_pca]
dist_anomaly = [mahalanobis(_img, mean, cov_inv) for _img in _anomaly_pca]
In [50]:
plt.hist(dist_normal,
         alpha=0.5, # the transaparency parameter 
         label='Distance distribution of normal data') 
  
plt.hist(dist_anomaly, 
         alpha=0.5, 
         label='Distance distribution of anomaly data') 
  
plt.legend(loc='upper right') 
plt.title('Overlapping histogram of normal and anomaly data with both alpha=0.5') 
plt.xlabel("Mahalanobis distance from Gaussian distribution")
plt.ylabel("Frequency")
plt.show()

Just from the histogram, we can see that the distribution overlaps more after using PCA

Maybe we cannot classify the rows just based on single distance. So, we need multi dimensional approach to further make better classification.

Using z-score for feature transformation¶

We can consider the mean of each column from normal data as a mean representation of normal images. So, we calculate the Gaussian distribution of normal images. Then, we find stanard deviation of each column.
We convert each row into z-score form using mean and variance.

In [51]:
from scipy import stats
np.set_printoptions(suppress=False)
In [52]:
# Find mean and covariance and standard deviation
_cov = np.cov(normal_data, rowvar=False)
mean = np.mean(normal_data, axis=0)
stand_dev = [_cov[i][i] for i in range(_cov.shape[0])]
stand_dev = np.sqrt(stand_dev)

Converts the array into z-score form

In [53]:
def z_transform(arr,mean,sd):
    # Difference with mean (x - x_bar)
    diff = (arr - mean)
    # Divide by standard deviation (x-x_bar)/sd
    z_score = diff/sd
    return z_score
In [54]:
# Transorm the normal and abomaly
normal_z_z = z_transform(normal_data, mean, stand_dev)
anomaly_z_z= z_transform(anomaly_data, mean, stand_dev)

here, we check the histogram of few columns after z transform

In [55]:
fig, ax = plt.subplots(10,2, figsize=(8,40))
indices =random.sample(range(255),10)
for i,t in enumerate(indices):
    ax[i][0].hist(normal_z_z[:,t])
    ax[i][1].hist(normal_z_z[:,t])
    ax[i][0].set_title(f"col_{t} normal")
    ax[i][1].set_title(f"col_{t} anomaly")

Few assumptions:

  • normal images should have relatively smaller zscore
  • if normal and anomaly columns have similar z-scores distribution, then that column does not represent the anomality
  • if normal and anomaly columns have different z-scores distribution, then that column differentiate the normal and the anomality

Preparing the dataset

In [56]:
X = np.concatenate((normal_z_z, anomaly_z_z), axis=0)
y = [0] * normal_z_z.shape[0] + [1] * anomaly_z_z.shape[0]

X_train, X_test, y_train, y_test = get_train_test_split(X =X, y = y)
In [57]:
exp_variance = 0.9
pca = PCA(exp_variance)
pca.fit(X_train)

print(f"n components is {len(pca.components_)}")
n components is 53
In [58]:
cum_var = np.cumsum(pca.explained_variance_ratio_ * 100)


plt.plot(cum_var)
plt.axhline(y = exp_variance*100, color = 'r', linestyle = 'dashed')   
plt.axvline(x = len(pca.components_), color = 'g', linestyle = 'dashed')   
plt.xlabel('Number of components')
plt.ylabel('Explained Variance')
plt.yticks(sorted(list(range(0,110,10))+[95]));
plt.grid()
plt.show()
In [59]:
train_pca = pca.transform(X_train)
test_pca = pca.transform(X_test)

nn = MLPClassifier(hidden_layer_sizes= (20,20), alpha = 0.01,max_iter=1000)
nn.fit(train_pca, y_train)
print("Training score is", nn.score(train_pca, y_train))
print("Test score is", nn.score(test_pca,y_test))
Training score is 1.0
Test score is 0.9141274238227147

Feature selection using PCA¶

Another approach is selecting the top k columns contributing the principal components. This way we can get actual columns rather than PCA transformed features.

In [60]:
# returns top 5 columns for a given order of principal compoenents
def top_k(a,k):
    return sorted(range(len(a)), key=lambda i: a[i], reverse= True)[:k]
In [61]:
# We use the first column of each principal component
most_important = [top_k(np.abs(pca.components_[i]),1) for i in range(len(pca.components_))]
most_important = sum(most_important,[])
most_important = list(set(most_important))
most_important = sorted(most_important)
len(most_important)
Out[61]:
43

Use the columns obtained above

In [62]:
X_train, X_test, y_train, y_test = get_train_test_split(X = X[:,most_important], y = y)

nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=1000)
nn.fit(X_train, y_train)
print("Training score is", nn.score(X_train, y_train))
print("Test score is", nn.score(X_test,y_test))
Training score is 1.0
Test score is 0.8614958448753463

If we select random 43 columns, will it do worse??¶

In [63]:
rand_cols = random.sample(range(255),len(most_important))
rand_cols = sorted(rand_cols)
In [64]:
X_train, X_test, y_train, y_test = get_train_test_split(X = X[:,rand_cols], y = y)

nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=2000)
nn.fit(X_train, y_train)
print("Training score is", nn.score(X_train, y_train))
print("Test score is", nn.score(X_test,y_test))
Training score is 1.0
Test score is 0.8698060941828255

Is it the transformation??¶

Here, we repeat the experiment using the same random columns but from the original dataset(without my custom transformation)

In [65]:
X = np.concatenate((normal_data, anomaly_data), axis=0)
y = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
X_train, X_test, y_train, y_test = get_train_test_split(X = X[:,rand_cols], y = y)

nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=1000)
nn.fit(X_train, y_train)
print("Training score is", nn.score(X_train, y_train))
print("Test score is", nn.score(X_test,y_test))
Training score is 0.8864265927977839
Test score is 0.8919667590027701

So, using random columns from transformed data, we get better performance than using the same columns from original data.

So, it is the work of the transformation. However, we could also do normal scaling and check if it's the same.

Random but with scaled data¶

In [66]:
# Applying scaling before using PCA
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [67]:
nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=1000)
nn.fit(X_train_scaled, y_train)
print("Training score is", nn.score(X_train_scaled, y_train))
print("Test score is", nn.score(X_test_scaled,y_test))
Training score is 1.0
Test score is 0.8836565096952909

However, even with normal scaling, we get the same performance as with z-score transformation.

To get the big picture, we will try making comparisions with different sizes of neural network and different columns or transformations.

Experimenting for multiples scenarios¶

Compare regular pick from regular, standard-scalar, z-transformed, PCA-z-transform

Data without any trasformation

In [68]:
X_train, X_test, y_train, y_test = get_train_test_split(rand_int = 100)

Standard Scaling on regular data

In [69]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Z-score transformation

In [70]:
X = np.concatenate((normal_z_z, anomaly_z_z), axis=0)
y = [0] * normal_z_z.shape[0] + [1] * anomaly_z_z.shape[0]

X_train_Z, X_test_Z, y_train_Z, y_test_Z = get_train_test_split(X= X, y = y, rand_int = 100)

Scaling the Z-score data

In [71]:
scaler_Z = StandardScaler()
scaler_Z.fit(X_train_Z)
X_train_scaled_Z = scaler.transform(X_train_Z)
X_test_scaled_Z = scaler.transform(X_test_Z)

Warning!! this cell has large runtime

In [72]:
# Stores everything 
results_dict = {}
hidden_layer_sizes = [(20,20),(20),(5),(1),()]
variance_percent = [0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 6, 0.55, 0.5, 0.45, 0.4]

for hidden_layer_size in tqdm(hidden_layer_sizes):
    pca_track = []
    pca_normal = []
    random_track_Z = []
    random_track_Z_scaled = []
    random_track = []
    random_scaled = []
    for _percent in tqdm(variance_percent):
        # Create pca transform
        pca = PCA(_percent)   
        # Fit the train data
        pca.fit(X_train_Z)
        components_count = len(pca.components_)
        # Transform the train data 
        x_pca = pca.transform(X_train_Z)
        # Simple NN model
        nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
        # Fit the model
        nn.fit(x_pca, y_train_Z)
        # Train score
        train_score = nn.score(x_pca, y_train_Z)
        # Test score
        x_test_pca =  pca.transform(X_test_Z)
        test_score = nn.score(x_test_pca,y_test_Z)
        # Append
        pca_track.append([components_count,train_score, test_score])
        
        # PCA with normal data
        pca = PCA(components_count)   
        # Fit the train data
        pca.fit(X_train_scaled)
        # Transform the train data 
        x_pca = pca.transform(X_train_scaled)
        # Simple NN model
        nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
        # Fit the model
        nn.fit(x_pca, y_train)
        # Train score
        train_score = nn.score(x_pca, y_train)
        # Test score
        x_test_pca =  pca.transform(X_test_scaled)
        test_score = nn.score(x_test_pca,y_test)
        # Append
        pca_normal.append([components_count, train_score, test_score])

        random.seed(2)
        # Random cols
        rand_cols = random.sample(range(255),components_count)
        rand_cols = sorted(rand_cols)

        # Random pick from z-score data
        # NN model
        nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
        # Train the model
        nn.fit(X_train_Z[:,rand_cols], y_train_Z)
        # Train score
        train_score = nn.score(X_train_Z[:,rand_cols], y_train_Z)
        # test score
        test_score = nn.score(X_test_Z[:,rand_cols],y_test_Z)
        # Append
        random_track_Z.append([components_count, train_score, test_score])

        # Random pick from scaled z-score data
        # NN model
        nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
        # Train the model
        nn.fit(X_train_scaled_Z[:,rand_cols], y_train_Z)
        # Train score
        train_score = nn.score(X_train_scaled_Z[:,rand_cols], y_train_Z)
        # test score
        test_score = nn.score(X_test_scaled_Z[:,rand_cols],y_test_Z)
        # Append
        random_track_Z_scaled.append([components_count, train_score, test_score])

        # Random pick from regular data
        # NN model
        nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
        # Train the model
        nn.fit(X_train[:,rand_cols], y_train)
        # Train score
        train_score = nn.score(X_train[:,rand_cols], y_train)
        # test score
        test_score = nn.score(X_test[:,rand_cols],y_test)
        # Append
        random_track.append([components_count, train_score, test_score])

        # Random pick from standard scaled regular data 
        # NN model
        nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
        # Train on scaled data
        nn.fit(X_train_scaled[:,rand_cols], y_train)
        train_score = nn.score(X_train_scaled[:,rand_cols], y_train)
        test_score = nn.score(X_test_scaled[:,rand_cols],y_test)
        random_scaled.append([components_count, train_score, test_score])
        results_dict[hidden_layer_size] = [pca_track, pca_normal, random_track_Z, random_track_Z_scaled, random_track, random_scaled ]
  0%|          | 0/5 [00:00<?, ?it/s]
  0%|          | 0/12 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:04<00:47,  4.30s/it]
 17%|█▋        | 2/12 [00:10<00:51,  5.18s/it]
 25%|██▌       | 3/12 [00:18<01:00,  6.77s/it]
 33%|███▎      | 4/12 [00:28<01:04,  8.11s/it]
 42%|████▏     | 5/12 [00:37<00:58,  8.30s/it]
 50%|█████     | 6/12 [00:47<00:52,  8.79s/it]
 58%|█████▊    | 7/12 [00:55<00:42,  8.48s/it]
 67%|██████▋   | 8/12 [01:01<00:31,  7.86s/it]
 75%|███████▌  | 9/12 [01:06<00:21,  7.02s/it]
 83%|████████▎ | 10/12 [01:10<00:12,  6.07s/it]
 92%|█████████▏| 11/12 [01:13<00:05,  5.15s/it]
100%|██████████| 12/12 [01:15<00:00,  6.32s/it]
 20%|██        | 1/5 [01:15<05:03, 75.90s/it]
  0%|          | 0/12 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:06<01:09,  6.36s/it]
 17%|█▋        | 2/12 [00:14<01:13,  7.39s/it]
 25%|██▌       | 3/12 [00:23<01:14,  8.31s/it]
 33%|███▎      | 4/12 [00:31<01:03,  7.95s/it]
 42%|████▏     | 5/12 [00:37<00:50,  7.18s/it]
 50%|█████     | 6/12 [00:42<00:38,  6.43s/it]
 58%|█████▊    | 7/12 [00:45<00:27,  5.46s/it]
 67%|██████▋   | 8/12 [00:47<00:17,  4.40s/it]
 75%|███████▌  | 9/12 [00:50<00:11,  3.78s/it]
 83%|████████▎ | 10/12 [00:51<00:06,  3.17s/it]
 92%|█████████▏| 11/12 [00:53<00:02,  2.70s/it]
100%|██████████| 12/12 [00:54<00:00,  4.54s/it]
 40%|████      | 2/5 [02:10<03:10, 63.34s/it]
  0%|          | 0/12 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:08<01:31,  8.32s/it]
 17%|█▋        | 2/12 [00:15<01:15,  7.55s/it]
 25%|██▌       | 3/12 [00:19<00:54,  6.09s/it]
 33%|███▎      | 4/12 [00:23<00:41,  5.13s/it]
 42%|████▏     | 5/12 [00:26<00:29,  4.25s/it]
 50%|█████     | 6/12 [00:28<00:22,  3.75s/it]
 58%|█████▊    | 7/12 [00:31<00:16,  3.31s/it]
 67%|██████▋   | 8/12 [00:33<00:11,  2.86s/it]
 75%|███████▌  | 9/12 [00:34<00:07,  2.34s/it]
 83%|████████▎ | 10/12 [00:36<00:04,  2.21s/it]
 92%|█████████▏| 11/12 [00:38<00:02,  2.18s/it]
100%|██████████| 12/12 [00:39<00:00,  3.29s/it]
 60%|██████    | 3/5 [02:49<01:44, 52.46s/it]
  0%|          | 0/12 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:06<01:06,  6.03s/it]
 17%|█▋        | 2/12 [00:11<00:57,  5.76s/it]
 25%|██▌       | 3/12 [00:16<00:47,  5.31s/it]
 33%|███▎      | 4/12 [00:18<00:33,  4.13s/it]
 42%|████▏     | 5/12 [00:21<00:26,  3.74s/it]
 50%|█████     | 6/12 [00:23<00:18,  3.09s/it]
 58%|█████▊    | 7/12 [00:25<00:13,  2.72s/it]
 67%|██████▋   | 8/12 [00:27<00:09,  2.48s/it]
 75%|███████▌  | 9/12 [00:29<00:07,  2.37s/it]
 83%|████████▎ | 10/12 [00:31<00:04,  2.17s/it]
 92%|█████████▏| 11/12 [00:32<00:01,  1.98s/it]
100%|██████████| 12/12 [00:33<00:00,  2.80s/it]
 80%|████████  | 4/5 [03:23<00:45, 45.04s/it]
  0%|          | 0/12 [00:00<?, ?it/s]
  8%|▊         | 1/12 [00:04<00:51,  4.68s/it]
 17%|█▋        | 2/12 [00:08<00:38,  3.90s/it]
 25%|██▌       | 3/12 [00:11<00:31,  3.49s/it]
 33%|███▎      | 4/12 [00:12<00:20,  2.61s/it]
 42%|████▏     | 5/12 [00:13<00:14,  2.11s/it]
 50%|█████     | 6/12 [00:14<00:10,  1.71s/it]
 58%|█████▊    | 7/12 [00:15<00:07,  1.45s/it]
 67%|██████▋   | 8/12 [00:16<00:04,  1.24s/it]
 75%|███████▌  | 9/12 [00:16<00:03,  1.07s/it]
 83%|████████▎ | 10/12 [00:17<00:01,  1.04it/s]
 92%|█████████▏| 11/12 [00:18<00:00,  1.01it/s]
100%|██████████| 12/12 [00:19<00:00,  1.62s/it]
100%|██████████| 5/5 [03:43<00:00, 44.60s/it]

Plotting the result¶

In [73]:
size0 = "0.5"
size1 = "1"
size2 = "2"
for key in results_dict.keys():
    pca_track,pca_normal, random_track_Z, random_track_Z_scaled, random_track, random_scaled = results_dict[key]
    n = np.array(random_track)[:,0]

    plt.figure()
    plt.figure(figsize = (12,10))
    # PCA
    plt.plot(n, np.array(pca_track)[:,1], marker = ".", linestyle ="--", label = "pca on Z train", linewidth = size0, color = "red")
    plt.plot(n, np.array(pca_track)[:,2], marker = ".", linestyle ="-", label = "pca on Z test", linewidth = size2, color = "red")
    # Column selection using PCA
    plt.plot(n, np.array(pca_normal)[:,1], marker = ".", linestyle ="--", label = "pca on original data train", linewidth = size0, color = "purple")
    plt.plot(n, np.array(pca_normal)[:,2], marker = ".", linestyle ="-", label = "pca on original data test", linewidth = size2, color = "purple")
    # Random selection from z-score
    plt.plot(n, np.array(random_track_Z)[:,1], marker = "*", linestyle ="--", label = "random Z train", linewidth = size0, color = "green")
    plt.plot(n, np.array(random_track_Z)[:,2], marker = "*", linestyle ="-", label = "random Z test", linewidth = size1, color = "green")
    # Random selection from scaled z-score
    plt.plot(n, np.array(random_track_Z_scaled)[:,1], marker = "s", linestyle ="--", label = "random scaled Z train", linewidth = size0, color = "orange")
    plt.plot(n, np.array(random_track_Z_scaled)[:,2], marker = "s", linestyle ="-", label = "random scaled Z test", linewidth = size1, color = "orange")
    # Random selection from regular
    plt.plot(n, np.array(random_track)[:,1], marker = "o", linestyle ="--", label = "random train", linewidth = size0, color = "blue")
    plt.plot(n, np.array(random_track)[:,2], marker = "o", linestyle ="-", label = "random test", linewidth = size1, color = "blue")
    # Random selection from standard scalar
    plt.plot(n, np.array(random_scaled)[:,1], marker = "^", linestyle ="--", label = "random scaled train", linewidth = size0, color = "black")
    plt.plot(n, np.array(random_scaled)[:,2], marker = "^", linestyle ="-", label = "random scaled test", linewidth = size1, color = "black")
    plt.legend()
    plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed', linewidth = "0.5")   
#     plt.axvline(x = 30, color = 'r', linestyle = 'dashed', linewidth = "0.5")   
    plt.xlabel("n components")
    plt.ylabel("Classification accuracy")
    # plt.xticks(n);
    plt.title(f"Random vs PCA on original and transformed columns with NN shape {key}")
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>

What's the observation??¶

From the plot we can see that, using PCA on z-transformed data and original data, we can get similar performance and much better than random selection approach. Hence, we can ignore using z-transformation and just opt for applying PCA on original data.

Another observation we make is that number of hidden layers do not influence the performance. The last plot with no hidden layer proves that. So, the best wa

Using simple Logistic Regression¶

In [74]:
X_train_scaled, X_test_scaled, y_train, y_test = get_train_test_split(rand_int = 10, scale = True)

clf = LogisticRegression(max_iter = 2000).fit(X_train_scaled, y_train)
clf.score(X_test_scaled, y_test)
Out[74]:
0.9556786703601108
In [75]:
# Dictionary of column and its coefficient
coef_dict = dict(zip(df.columns[:-1],clf.coef_[0]))
# Ordering the dictionary based on coefficient value
coef_dict = {k: v for k, v in sorted(coef_dict.items(), key=lambda item: abs(item[1]), reverse=True)}
In [76]:
n_col = 10
new_cols_names = list(coef_dict.keys())[:n_col]
new_cols_names
Out[76]:
['c211', 'c99', 'c50', 'c37', 'c83', 'c212', 'c22', 'c46', 'c144', 'c121']
In [77]:
title = "Coefficient weights"
x = list(coef_dict.keys())[:n_col]
y = list(coef_dict.values())[:n_col]
plt.scatter(x,y)
for i, label in enumerate(x):
    plt.annotate(str(label), (x[i],y[i]) , rotation=60,size=8)
plt.xticks(rotation = 45) # Rotates X-Axis Ticks by 45-degrees
plt.xticks("")
plt.grid()
plt.title(title, {'fontsize': 15},pad=30 )
plt.show()
In [78]:
# Dictionary of column and its coefficient
coef_dict = dict(zip(range(0,X_train_scaled.shape[1]),clf.coef_[0]))
# Ordering the dictionary based on coefficient value
coef_dict = {k: v for k, v in sorted(coef_dict.items(), key=lambda item: abs(item[1]), reverse=True)}

Comparisions for different column length

In [79]:
n_cols = [50,45,40,35,30,25,20,15,10,5]
pca_scores = []
lr_coe_scores = []
for n_col in n_cols:
    # Slice the columns
    new_cols_index = list(coef_dict.keys())[:n_col]

    # PCA with normal data
    pca = PCA(n_col)
    pca_score = []
    lr_score = []
    # We store average score from 10 random train test split
    for i in np.random.randint(10000,size = 10):
        X_train_scaled, X_test_scaled, y_train, y_test = get_train_test_split(rand_int = i, scale = True)
        # Fit the train data
        pca.fit(X_train_scaled)
        # Transform the train data 
        x_pca = pca.transform(X_train_scaled)
        # Simple NN model
        lr = LogisticRegression()
        # Fit the model
        lr.fit(x_pca, y_train)
        # Train score
        train_score = lr.score(x_pca, y_train)
        # Test score
        x_test_pca =  pca.transform(X_test_scaled)
        test_score = lr.score(x_test_pca,y_test)
        pca_score.append([train_score, test_score])
        
        # Logistic regression using columns given by using coefficiets 
        clf = LogisticRegression().fit(X_train_scaled[:,new_cols_index], y_train)
        train_score = clf.score(X_train_scaled[:,new_cols_index], y_train)
        test_score = clf.score(X_test_scaled[:,new_cols_index], y_test)
        lr_score.append([train_score, test_score])
    #Sore the average values 
    train_score, test_score = np.array(pca_score).mean(axis = 0)
    pca_scores.append([n_col,train_score, test_score])

    train_score, test_score = np.array(lr_score).mean(axis = 0)    
    lr_coe_scores.append([n_col,train_score, test_score])
In [80]:
n = np.array(pca_scores)[:,0]
plt.plot(n, np.array(pca_scores)[:,1], marker = ".", linestyle ="--", label = "PCA LR train", linewidth = "0.5", color = "purple")
plt.plot(n, np.array(pca_scores)[:,2], marker = ".", linestyle ="-", label = "PCA LR test", linewidth = "1", color = "purple")

plt.plot(n, np.array(lr_coe_scores)[:,1], marker = "o", linestyle ="--", label = "LR coeffiecient selected train", linewidth = "0.5", color = "blue")
plt.plot(n, np.array(lr_coe_scores)[:,2], marker = "o", linestyle ="-", label = "LR coeffiecient selected test", linewidth = "1", color = "blue")
plt.axhline(y = 0.9, color = 'black', linestyle = 'dashed', linewidth = "0.5")   

plt.legend()
Out[80]:
<matplotlib.legend.Legend at 0x1fc94749790>

So, in average we can get 90% classification accuracy with just 40 columns.

In [81]:
print(list(coef_dict.keys())[:40])
[211, 99, 50, 37, 83, 212, 22, 46, 144, 121, 122, 206, 215, 169, 123, 138, 58, 172, 0, 132, 103, 240, 194, 24, 4, 192, 202, 35, 207, 146, 125, 231, 55, 143, 184, 49, 128, 181, 8, 23]

These are the columns index in the dataframe!!

Accuracy, Recall, Precision and F1 score of the model for each class¶

In [82]:
# We do cross-validation on whole data set
X = np.concatenate((normal_data, anomaly_data), axis=0)
y = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
# Scaling
scaler = StandardScaler()
scaler.fit(X)
# Transform
X = scaler.transform(X)
# Select the columns
new_cols_index = list(coef_dict.keys())[:40]
model = LogisticRegression()

Accuracy

In [83]:
# For anomaly data

is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="accuracy").mean()
Out[83]:
0.9069275629220381
In [84]:
# For normal data

is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="accuracy").mean()
Out[84]:
0.9069275629220381

Recall

In [85]:
# For anomaly data

is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="recall").mean()
Out[85]:
0.7545454545454546
In [86]:
# For normal data

is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="recall").mean()
Out[86]:
0.9560057964791756

precision

In [87]:
# For anomaly data

is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="precision").mean()
Out[87]:
0.8519167426883248
In [88]:
# For normal data

is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="precision").mean()
Out[88]:
0.9274925113227367

F1 score

In [89]:
# For anomaly data

is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="f1").mean()
Out[89]:
0.7830255596091451
In [90]:
# For normal data

is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
                cv=10, scoring="f1").mean()
Out[90]:
0.940144619716649

There are few reasons we can think of why there is such discrepancy in mainly recall. First, there are many ways anomaly can occur compared to just one form of normality. On top of that, there is data imbalance with lesser anomaly data. This leads to high number of False negatives for anomaly class.

Conclusion¶

We compared the performance of different feature reduction approaches including correlation, PCA random and regression coefficient. We also applied feature transformation like scaling and z-score transformation. We tried simple classifier like detecting outlier using gaussian distribution.

After different analysis, we conclude that even for complex raw image dataset, if we use pretrained layer for feature extraction, we can construct a simple logistic regression binary classifier to classify normal and anomaly images. There is no need of advanced complex CNN or deep neural network from scratch.

At the same time, we can reduce the column size to as many as 35 out of 256 columns to get average test score of 90%!!

This is more explainable approach since we can now check and visit each channel in the pretrained layer to see what particular kernel activates to have better understanding of anomality.